{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MarkdownHeaderTextSplitter\n",
    "\n",
    "- Author: [HeeWung Song(Dan)](https://github.com/kofsitho87)\n",
    "- Peer Review : [BokyungisaGod](https://github.com/BokyungisaGod), [Chaeyoon Kim](https://github.com/chaeyoonyunakim)\n",
    "- Proofread : [Chaeyoon Kim](https://github.com/chaeyoonyunakim)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/06-MarkdownHeaderTextSplitter.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/06-MarkdownHeaderTextSplitter.ipynb)\n",
    "## Overview\n",
    "\n",
    "This tutorial introduces how to effectively split **Markdown documents** using LangChain's ```MarkdownHeaderTextSplitter```. This tool divides documents into meaningful sections based on Markdown headers, preserving the document's structure for systematic content processing.\n",
    "\n",
    "Context and structure of documents are crucial for effective text embedding. Simply dividing text isn't enough; maintaining semantic connections is key to generating more comprehensive vector representations. This is particularly true when dealing with large documents, as preserving context can significantly enhance the accuracy of subsequent analysis and search operations.\n",
    "\n",
    "The ```MarkdownHeaderTextSplitter``` splits documents according to specified header sets, managing the content under each header group as separate chunks. This enables efficient content processing while maintaining the document's structural coherence.\n",
    "\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [Basic Usage of MarkdownHeaderTextSplitter](#basic-usage-of-markdownheadertextsplitter)\n",
    "- [Combining with Other Text Splitters](#combining-with-other-text-splitters)\n",
    "\n",
    "### References\n",
    "\n",
    "- [Langchain MarkdownHeaderTextSplitter](https://python.langchain.com/docs/how_to/markdown_header_metadata_splitter/)\n",
    "- [Langchain RecursiveCharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#recursivecharactertextsplitter)\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Setting up your environment is the first step. See the [Environment Setup](https://wikidocs.net/257836) guide for more details.\n",
    "\n",
    "**[Note]**\n",
    "- The ```langchain-opentutorial``` is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.\n",
    "- Check out the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langsmith\",\n",
    "        \"langchain\",\n",
    "        \"langchain_core\",\n",
    "        \"langchain_community\",\n",
    "        \"langchain_text_splitters\",\n",
    "        \"langchain_openai\",\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set environment variables\n",
    "from langchain_opentutorial import set_env\n",
    "\n",
    "set_env(\n",
    "    {\n",
    "        \"OPENAI_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_TRACING_V2\": \"true\",\n",
    "        \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
    "        \"LANGCHAIN_PROJECT\": \"MarkdownHeaderTextSplitter\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can set and load ```OPENAI_API_KEY``` from a ```.env``` file.\n",
    "\n",
    "**[Note]** This is only necessary if you haven't already set ```OPENAI_API_KEY``` in previous steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage of MarkdownHeaderTextSplitter\n",
    "\n",
    "The ```MarkdownHeaderTextSplitter``` splits Markdown-formatted text based on headers. Here's how to use it:\n",
    "\n",
    "- First, the splitter divides the text based on standard Markdown headers (#, ##, ###, etc.).\n",
    "- Store the Markdown you want to split in a variable called markdown_document.\n",
    "- You'll need a list called ```headers_to_split_on```. This list uses tuples to define the header levels you want to split on and what you want to call them.\n",
    "- Now, create a ```markdown_splitter``` object using the ```MarkdownHeaderTextSplitter``` class, and give it that ```headers_to_split_on``` list.\n",
    "- To actually split the text, call the ```split_text``` method on your ```markdown_splitter``` object, passing in your ```markdown_document```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Title\n",
      "\n",
      "## 1. SubTitle\n",
      "\n",
      "Hi this is Jim\n",
      "\n",
      "Hi this is Joe\n",
      "\n",
      "### 1-1. Sub-SubTitle \n",
      "\n",
      "Hi this is Lance \n",
      "\n",
      "## 2. Baz\n",
      "\n",
      "Hi this is Molly\n"
     ]
    }
   ],
   "source": [
    "from langchain_text_splitters import MarkdownHeaderTextSplitter\n",
    "\n",
    "# Define a markdown document as a string\n",
    "markdown_document = \"# Title\\n\\n## 1. SubTitle\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n### 1-1. Sub-SubTitle \\n\\nHi this is Lance \\n\\n## 2. Baz\\n\\nHi this is Molly\"\n",
    "print(markdown_document)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hi this is Jim  \n",
      "Hi this is Joe\n",
      "{'Header 1': 'Title', 'Header 2': '1. SubTitle'}\n",
      "=====================\n",
      "Hi this is Lance\n",
      "{'Header 1': 'Title', 'Header 2': '1. SubTitle', 'Header 3': '1-1. Sub-SubTitle'}\n",
      "=====================\n",
      "Hi this is Molly\n",
      "{'Header 1': 'Title', 'Header 2': '2. Baz'}\n",
      "=====================\n"
     ]
    }
   ],
   "source": [
    "headers_to_split_on = [  # Define header levels and their names for document splitting\n",
    "    (\n",
    "        \"#\",\n",
    "        \"Header 1\",\n",
    "    ),  # Header level 1 is marked with '#' and named 'Header 1'\n",
    "    (\n",
    "        \"##\",\n",
    "        \"Header 2\",\n",
    "    ),  # Header level 2 is marked with '##' and named 'Header 2'\n",
    "    (\n",
    "        \"###\",\n",
    "        \"Header 3\",\n",
    "    ),  # Header level 3 is marked with '###' and named 'Header 3'\n",
    "]\n",
    "\n",
    "# Create a MarkdownHeaderTextSplitter object to split text based on markdown headers\n",
    "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
    "# Split markdown_document by headers and store in md_header_splits\n",
    "md_header_splits = markdown_splitter.split_text(markdown_document)\n",
    "# Print the split results\n",
    "for header in md_header_splits:\n",
    "    print(f\"{header.page_content}\")\n",
    "    print(f\"{header.metadata}\", end=\"\\n=====================\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Header Retention in Split Output\n",
    "\n",
    "By default, the ```MarkdownHeaderTextSplitter``` removes headers from the output chunks.\n",
    "\n",
    "However, you can configure the splitter to retain these headers by setting ```strip_headers``` parameter to ```False```.\n",
    "\n",
    "Example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Title  \n",
      "## 1. SubTitle  \n",
      "Hi this is Jim  \n",
      "Hi this is Joe\n",
      "{'Header 1': 'Title', 'Header 2': '1. SubTitle'}\n",
      "=====================\n",
      "### 1-1. Sub-SubTitle  \n",
      "Hi this is Lance\n",
      "{'Header 1': 'Title', 'Header 2': '1. SubTitle', 'Header 3': '1-1. Sub-SubTitle'}\n",
      "=====================\n",
      "## 2. Baz  \n",
      "Hi this is Molly\n",
      "{'Header 1': 'Title', 'Header 2': '2. Baz'}\n",
      "=====================\n"
     ]
    }
   ],
   "source": [
    "markdown_splitter = MarkdownHeaderTextSplitter(\n",
    "    # Specify headers to split on\n",
    "    headers_to_split_on=headers_to_split_on,\n",
    "    # Set to keep headers in the output\n",
    "    strip_headers=False,\n",
    ")\n",
    "# Split markdown document based on headers\n",
    "md_header_splits = markdown_splitter.split_text(markdown_document)\n",
    "# Print the split results\n",
    "for header in md_header_splits:\n",
    "    print(f\"{header.page_content}\")\n",
    "    print(f\"{header.metadata}\", end=\"\\n=====================\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Combining with Other Text Splitters\n",
    "\n",
    "After splitting by Markdown headers, you can further process the content within each Markdown group using any desired text splitter.\n",
    "\n",
    "In this example, we'll use the ```RecursiveCharacterTextSplitter``` to demonstrate how to effectively combine different splitting methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Intro \n",
      "\n",
      "## History \n",
      "\n",
      "Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n",
      "\n",
      "Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n",
      "\n",
      "## Rise and divergence \n",
      "\n",
      "As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n",
      "\n",
      "additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n",
      "\n",
      "#### Standardization \n",
      "\n",
      "From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n",
      "\n",
      "# Implementations \n",
      "\n",
      "Implementations of Markdown are available for over a dozen programming languages.\n"
     ]
    }
   ],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "markdown_document = \"# Intro \\n\\n## History \\n\\nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n## Rise and divergence \\n\\nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n#### Standardization \\n\\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n# Implementations \\n\\nImplementations of Markdown are available for over a dozen programming languages.\"\n",
    "print(markdown_document)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, use ```MarkdownHeaderTextSplitter``` to split the Markdown document based on its headers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Intro  \n",
      "## History  \n",
      "Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]  \n",
      "Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.  \n",
      "## Rise and divergence  \n",
      "As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for  \n",
      "additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.  \n",
      "#### Standardization  \n",
      "From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.\n",
      "{'Header 1': 'Intro'}\n",
      "=====================\n",
      "# Implementations  \n",
      "Implementations of Markdown are available for over a dozen programming languages.\n",
      "{'Header 1': 'Implementations'}\n",
      "=====================\n"
     ]
    }
   ],
   "source": [
    "headers_to_split_on = [\n",
    "    (\"#\", \"Header 1\"),  # Specify the header level and its name to split on\n",
    "    # (\"##\", \"Header 2\"),\n",
    "]\n",
    "\n",
    "# Split the markdown document based on header levels\n",
    "markdown_splitter = MarkdownHeaderTextSplitter(\n",
    "    headers_to_split_on=headers_to_split_on, strip_headers=False\n",
    ")\n",
    "md_header_splits = markdown_splitter.split_text(markdown_document)\n",
    "# Print the split results\n",
    "for header in md_header_splits:\n",
    "    print(f\"{header.page_content}\")\n",
    "    print(f\"{header.metadata}\", end=\"\\n=====================\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we'll further split the output of the ```MarkdownHeaderTextSplitter``` using the ```RecursiveCharacterTextSplitter```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Intro  \n",
      "## History\n",
      "{'Header 1': 'Intro'}\n",
      "=====================\n",
      "Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its\n",
      "{'Header 1': 'Intro'}\n",
      "=====================\n",
      "readers in its source code form.[9]\n",
      "{'Header 1': 'Intro'}\n",
      "=====================\n",
      "Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.  \n",
      "## Rise and divergence\n",
      "{'Header 1': 'Intro'}\n",
      "=====================\n",
      "As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for\n",
      "{'Header 1': 'Intro'}\n",
      "=====================\n",
      "additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.  \n",
      "#### Standardization\n",
      "{'Header 1': 'Intro'}\n",
      "=====================\n",
      "From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.\n",
      "{'Header 1': 'Intro'}\n",
      "=====================\n",
      "# Implementations  \n",
      "Implementations of Markdown are available for over a dozen programming languages.\n",
      "{'Header 1': 'Implementations'}\n",
      "=====================\n"
     ]
    }
   ],
   "source": [
    "chunk_size = 200  # Specify the size of each split chunk\n",
    "chunk_overlap = 20  # Specify the number of overlapping characters between chunks\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
    ")\n",
    "\n",
    "# Split the document into chunks by characters\n",
    "splits = text_splitter.split_documents(md_header_splits)\n",
    "# Print the split results\n",
    "for header in splits:\n",
    "    print(f\"{header.page_content}\")\n",
    "    print(f\"{header.metadata}\", end=\"\\n=====================\\n\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "crawler-app",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
